The Kaggle Workbook by Konrad Banachewicz & Luca Massaron

The Kaggle Workbook by Konrad Banachewicz & Luca Massaron

Author:Konrad Banachewicz & Luca Massaron
Language: eng
Format: epub
Publisher: Packt
Published: 2023-12-15T00:00:00+00:00


In addition, the decimal part of the price is processed as a feature, in order to reveal a situation when the item is sold at psychological pricing thresholds (e.g., $19.99 or £2.98 – see this discussion: https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/145011).

The function math.modf (https://docs.python.org/3.8/library/math.html#math.modf) helps in doing so because it splits any floating-point number into fractional and integer parts (a two-item tuple).

Finally, the resulting table is saved onto disk.

Here is the function doing all the feature engineering on prices:

def generate_grid_price(prices_df, calendar_df, end_train_day_x, predict_horizon): grid_df = pd.read_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_x + predict_horizon}.feather") prices_df['price_max'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].transform('max') prices_df['price_min'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].transform('min') prices_df['price_std'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].transform('std') prices_df['price_mean'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].transform('mean') prices_df['price_norm'] = prices_df['sell_price'] / prices_df['price_max'] prices_df['price_nunique'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].transform('nunique') prices_df['item_nunique'] = prices_df.groupby(['store_id', 'sell_price'])['item_id'].transform('nunique') calendar_prices = calendar_df[['wm_yr_wk', 'month', 'year']] calendar_prices = calendar_prices.drop_duplicates(subset=['wm_yr_wk']) prices_df = prices_df.merge(calendar_prices[['wm_yr_wk', 'month', 'year']], on=['wm_yr_wk'], how='left') del calendar_prices gc.collect() prices_df['price_momentum'] = prices_df['sell_price'] / prices_df.groupby(['store_id', 'item_id'])[ 'sell_price'].transform(lambda x: x.shift(1)) prices_df['price_momentum_m'] = prices_df['sell_price'] / prices_df.groupby(['store_id', 'item_id', 'month'])[ 'sell_price'].transform('mean') prices_df['price_momentum_y'] = prices_df['sell_price'] / prices_df.groupby(['store_id', 'item_id', 'year'])[ 'sell_price'].transform('mean') prices_df['sell_price_cent'] = [math.modf(p)[0] for p in prices_df['sell_price']] prices_df['price_max_cent'] = [math.modf(p)[0] for p in prices_df['price_max']] prices_df['price_min_cent'] = [math.modf(p)[0] for p in prices_df['price_min']] del prices_df['month'], prices_df['year'] prices_df = reduce_mem_usage(prices_df, verbose=False) gc.collect() original_columns = list(grid_df) grid_df = grid_df.merge(prices_df, on=['store_id', 'item_id', 'wm_yr_wk'], how='left') del(prices_df) gc.collect() keep_columns = [col for col in list(grid_df) if col not in original_columns] grid_df = grid_df[['id', 'd'] + keep_columns] grid_df = reduce_mem_usage(grid_df, verbose=False) grid_df.to_feather(f"grid_price_{end_train_day_x}_to_{end_train_day_x + predict_horizon}.feather") del(grid_df) gc.collect()

The next function computes the moon phase, returning one of its eight phases (from new moon to waning crescent). Although moon phases shouldn’t directly influence any sales (weather conditions instead do, but we have no weather information in the data), they represent a periodic cycle of 29 and a half days, which can well suit periodic shopping behaviors.

There is an interesting discussion, with different hypotheses regarding why moon phases may work as a predictor, in this competition post: https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/154776:

def get_moon_phase(d): # 0=new, 4=full; 4 days/phase diff = datetime.datetime.strptime(d, '%Y-%m-%d') - datetime.datetime(2001, 1, 1) days = dec(diff.days) + (dec(diff.seconds) / dec(86400)) lunations = dec("0.20439731") + (days * dec("0.03386319269")) phase_index = math.floor((lunations % dec(1) * dec(8)) + dec('0.5')) return int(phase_index) & 7

The moon phase function is part of a general function for creating time-based features. The function takes the calendar dataset information and places it among the features. Such information contains events and their type as well as an indication of the SNAP periods that could drive furthermore sales of basic goods. The function also generates numeric features such as the day, the month, the year, the day of the week, the week in the month, and if it is the end of the week. Here is the code:

def generate_grid_calendar(calendar_df, end_train_day_x, predict_horizon): grid_df = pd.read_feather( f"grid_df_{end_train_day_x}_to_{end_train_day_x + predict_horizon}.feather") grid_df = grid_df[['id', 'd']] gc.collect() calendar_df['moon'] = calendar_df.date.apply(get_moon_phase) # Merge calendar partly icols = ['date', 'd', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'moon', ] grid_df = grid_df.merge(calendar_df[icols], on=['d'], how='left') icols = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI'] for col in icols: grid_df[col] = grid_df[col].astype('category') grid_df['date'] = pd.to_datetime(grid_df['date']) grid_df['tm_d'] = grid_df['date'].dt.day.astype(np.int8) grid_df['tm_w'] = grid_df['date'].dt.isocalendar().week.astype(np.int8) grid_df['tm_m'] = grid_df['date'].dt.month.astype(np.int8) grid_df['tm_y'] = grid_df['date'].dt.year grid_df['tm_y'] = (grid_df['tm_y'] - grid_df['tm_y'].



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.